import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import copy
%matplotlib inline
from __future__ import absolute_import, print_function
import numpy as np
from scipy.optimize import minimize
from scipy.stats import norm
import scipy as sp
from scipy.spatial.distance import mahalanobis
import datetime
from math import pi, ceil
# Visualization
from IPython.display import Image
# Data prepping
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
Predict song’s popularity based on song’s audio features.
Design a personalized song recommender system based on song’s audio features.
Spotify dataset was provided to us by the instructors of this course.
Dataset consisted of 28 music files: one for each year starting from 1991 to 2018. We merged all these files in excel to create one single music file, which we named as “MergedMusic”. We imported this data in our code with the name as “Music”.
music=pd.read_csv('MergedMusic.csv',encoding='latin-1')
It can be seen from the below distribution that not all the features are normally distributed. Hence, we scaled the features before applying machine learning algorithms.
features = ['acousticness','danceability','energy','instrumentalness','key','liveness','loudness','mode',
'speechiness','tempo','valence']
df_features = music[features]
df_features.hist(figsize=(20,10))
plt.show()
#Target variable distribution
sns.distplot(music['popularity'])
sns.boxplot(music['popularity'])
sns.distplot(music['popularity'],bins=10,kde=False)
music['popularity'].describe()
music_features = pd.DataFrame(music[['acousticness','danceability','duration_ms','energy','instrumentalness','liveness','loudness','speechiness','tempo','valence']])
music_features.to_csv('music_features',index=False)
sns.pairplot(music_features)
music_features.corr()
music.shape
For the first objective of our project, which was to predict the popularity of a song based on its audio features, we used the music data.
Image(filename='Part_1_Approach.PNG')
The popularity column (target variable) had values ranging from 0 to 100. It had mean of 9.8. To make it a classification problem, we created two categories for this column based on the mean. For all the values in this column which were 10 or above (above the mean), they were labeled as 1 (“Hit”) whereas the rest were labeled as 0 (“Flop”). So, based on Popularity column, we created a new column in our data, named as PopularityCategory, which had two possible classes: 0 and 1 (Flop and Hit).
#Popularity category column added (>=10 -> Hit and <10 : Flop)
def inputPopularity(cols):
if cols>=10:
return 1
else:
return 0
music['popularityCategory'] = music['popularity'].apply(inputPopularity)
music_features = pd.DataFrame(music[['acousticness','danceability','duration_ms','energy','instrumentalness','liveness','loudness','speechiness','tempo','valence','popularityCategory']])
Next we had to standardized our data. This was done because different columns of our data corresponded to different units. In this case to compare, we had to remove the units. To do this in a consistent way, we standardized and transformed the data to have a common scale before building machine learning models.
#Standard Scaler
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(music_features)
scaled_data = scaler.transform(music_features)
scaled_data1=scaled_data[:,[0,1,2,3,4,5,6,7,8,9]]
new_music_col_names=music_features.columns[0:10]
The new_music_data dataframe had the scaled features along with our target variable, named as “popularity”. We used this dataframe to apply all the models which we will discuss shortly.
new_music_data=pd.DataFrame(scaled_data1,columns=new_music_col_names)
new_music_data['popularity']=music['popularityCategory']
new_music_data.head()
We divided the data into test and training datasets. We took 25% of our data as test dataset. These test and train split will be used for all the models which we have applied.
from sklearn.model_selection import train_test_split
#spliting the above X and Y data into test data and training data (75% for model training and 25% for model testing)
X_train, X_test, y_train, y_test = train_test_split(new_music_data.drop('popularity',axis=1),
new_music_data['popularity'], test_size=0.25,
random_state=500)
We had applied the following machine learning algorithms: -Logistic Regression -Bayesian Logistic Regression -SVM -Decision Tree -Random Forest -KNN
Along with running the basic model/algorithm on our data, we tuned most of the models using GridSearchCV and RandomizedSearchCV.
The score/metric which we tried to maximize is the Accuracy score which is the mean accuracy that the predicted label is same as the true label for a given song.
Image(filename='Result_Part_1.jpeg')
Logistic regression uses logarithmic transformation on the outcome variable, allowing us to model non-linear assumption in a linear way. Although a simple model, yet it is very powerful and is generally the starting model used in a binary classification problem. We used this model because because our features are not normal and this algorithm is less affected by the normality of the independent variables.
from sklearn.linear_model import LogisticRegression
logmodel = LogisticRegression()
logmodel.fit(X_train,y_train)
predictions = logmodel.predict(X_test)
from sklearn.metrics import classification_report,confusion_matrix
confusion_matrix(y_test, predictions)
print(classification_report(y_test,predictions))
#Accuracy score of logistic regression (mean accuracy which is same as best score)
logmodel.score(X_test,y_test)
The basic logistic model gave us the accuracy score of 0.6265992757996379. To further improve it, we did GridSearchCV (without specifying c), GridSearchCV (with specifying values for c) and RandomizedSearchCV.
from sklearn.model_selection import GridSearchCV
#Random Search without specifying c
dual=[True,False]
max_iter=[100,110,120,130,140]
param_grid = dict(dual=dual,max_iter=max_iter)
## Didnt work out well
import time
lr = LogisticRegression(penalty='l2')
grid = GridSearchCV(estimator=lr, param_grid=param_grid, cv = 3, n_jobs=-1)
start_time = time.time()
grid_result = grid.fit(X_train, y_train)
# Summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
print("Execution time: " + str((time.time() - start_time)) + ' ms')
#Random Search with specifying c (results are not better than original model. Also execution time has increased)
dual=[True,False]
max_iter=[100,110,120,130,140]
C = [1.0,1.5,2.0,2.5]
param_grid = dict(dual=dual,max_iter=max_iter,C=C)
#did not improve the accuracy
lr = LogisticRegression(penalty='l2')
grid = GridSearchCV(estimator=lr, param_grid=param_grid, cv = 3, n_jobs=-1)
start_time = time.time()
grid_result = grid.fit(X_train, y_train)
# Summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
print("Execution time: " + str((time.time() - start_time)) + ' ms')
from sklearn.model_selection import RandomizedSearchCV
##didnt do anything
#Accuracy doesn't improve. But execution time is less than grid search.
random = RandomizedSearchCV(estimator=lr, param_distributions=param_grid, cv = 3, n_jobs=-1)
start_time = time.time()
random_result = random.fit(X_train, y_train)
# Summarize results
print("Best: %f using %s" % (random_result.best_score_, random_result.best_params_))
print("Execution time: " + str((time.time() - start_time)) + ' ms')
All these three models, ran to tune the original model, improved the accuracy score slightly. However, GridSearchCV (with c specified) was computationally expensive (took a lot of time). Based on the results, we concluded that GridSearchCV (without specifying c) gave us the highest accuracy result, i.e. 0.627759141818255.
Best_Accuracy_Logistic = grid_result.best_score_
Best_Accuracy_Logistic
Bayesian logistic regression using a Laplace (Gaussian) approximation for the posterior distribution of the fitted parameters. As such, it is much faster than using MCMC. It requires numpy and scipy to work.
Briefly, Bayesian logistic regression works by defining a Gaussian prior on the parameter vector $w$ such that $p(w) = \mathcal{N}(w_0, \Sigma_0)$. Crucially, $w_0 \ne 0$ necessarily (non-zero mean) and $\Sigma_0$ is an arbitrary covariance matrix. Note that if $w_0 =0$ and $\Sigma_0$ was diagonal with identical values, then the problem would reduce to L2 regularized logistic regression.
Inclusion of this prior imposes a penalization upon the log likelihood of any data $\mathcal{D}$ we observe. One then has to minimize the negative log posterior which is equal to the following objective up to a constant.
$ f(w|\mathcal{D}) = - \sum_{i=1}^N \{ y_i \log(p_i) + (1-y_i)\log(1-p_i) \} + \frac{1}{2}(w - w_0)^T \Sigma_0^{-1} (w -w_0)$
where $p_i = e^{X w} /(1 + e^{Xw})$ is the standard logistic probability. This can be minimized by any gradient (Hessian) based solver. Here we use scipy.optimize.minimize
Under the Laplace approximation, the posterior to the parameter vector $w$ is given by a normal distribution $\mathcal{N}(w, H^{-1})$ where $H = \nabla_w^2 f(w|\mathcal{D}) |_{\hat w}$ is the Hessian of the negative penalized log likelihood (log posterior).
$H$ is the same Hessian function that would be used for a Hessian based optimization (such as Newton-CG) of $f(w|\mathcal{D})$. Even if we use a gradent based optimization (such as BFGS) we still need to compute the Hessian to evaluate the posterior distribution. Of course one can use a diagonal approximation to the Hessian (and we allow for this) if there are a large number of parameters.
Using the full (Laplace) posterior of $w$, one can calculate the full posterior predictive e.g. the the probability $p(y|x,\mathcal{D})$ which is a moderated (relaxed towards 0.5) version of $p_i$ (the standard logistic probability). This takes into account the uncertainty in our estimate of $w$.
After logistic regression, we used bayesian logistic regression. Logistic regression uses MLE which assumes parameters to be unknown but fixed. Whereas, Bayesian statistics quantifies the uncertainty of unknown parameters using probability, & helps them in considering as random variables. We Used bayesian logistic regression to improve performance over time & update our beliefs as new information came in but still gave the information we've already learned in the past
y=new_music_data.values[:, 10]
X = np.delete(new_music_data.values, 10, 1)
n_samples, n_features = X.shape
# Add bias column to the feature matrix
B = np.ones((n_samples, n_features + 1))
B[:, 1:] = X
X = B
def logistic_prob(X, w):
""" MAP (Bayes point) logistic regression probability with overflow prevention via exponent truncation
Parameters
----------
X : array-like, shape (N, p)
Feature matrix
w : array-like, shape (p, )
Parameter vector
Returns
-------
pr : array-like, shape (N, )
vector of logistic regression probabilities
References
----------
Chapter 8 of Murphy, K. 'Machine Learning a Probabilistic Perspective', MIT Press (2012)
Chapter 4 of Bishop, C. 'Pattern Recognition and Machine Learning', Springer (2006)
"""
# set a truncation exponent.
trunc = 8. # exp(8)/(1+exp(8)) = 0.9997 which is close enough to 1 as to not matter in most cases.
# calculate argument of logit
z = np.dot(X, w)
# truncate to avoid numerical over/underflow
z = np.clip(z, -trunc, trunc)
#print('z',z)
# calculate logitstic probability
pr = np.exp(z)
pr = pr / (1. + pr)
#print('pr',pr)
return pr
def f_log_posterior(w, wprior, H, y, X, weights=None):
"""Returns negative log posterior probability.
Parameters
----------
w : array-like, shape (p, )
vector of parameters at which the negative log posterior is to be evaluated
wprior : array-like, shape (p, )
vector of prior means on the parameters to be fit
H : array-like, shape (p, p) or (p, )
Array of prior Hessian (inverse covariance of prior distribution of parameters)
y : array-like, shape (N, )
vector of binary ({0,1} responses)
X : array-like, shape (N, p)
array of features
weights : array-like, shape (N, )
vector of data point weights. Should be within [0,1]
Returns
-------
neg_log_post : float
negative log posterior probability
References
----------
Chapter 8 of Murphy, K. 'Machine Learning a Probabilistic Perspective', MIT Press (2012)
Chapter 4 of Bishop, C. 'Pattern Recognition and Machine Learning', Springer (2006)
"""
# fill in weights if need be
if weights is None:
weights = np.ones(len(np.atleast_1d(y)), )
if len(np.atleast_1d(weights)) != len(np.atleast_1d(y)):
raise ValueError(' weight vector must be same length as response vector')
# calculate negative log posterior
eps = 1e-6 # this defined to ensure that we never take a log of zero
mu = logistic_prob(X, w)
if len(H.shape) == 2:
neg_log_post = (- (np.dot(y.T, weights * np.log(mu + eps))
+ np.dot((1. - y).T, weights * np.log(1. - mu + eps)))
+ 0.5 * np.dot((w - wprior).T, np.dot(H, (w - wprior))))
elif len(H.shape) == 1:
neg_log_post = (- (np.dot(y.T, weights * np.log(mu + eps))
+ np.dot((1. - y).T, weights * np.log(1. - mu + eps)))
+ 0.5 * np.dot((w - wprior).T, H * (w - wprior)))
else:
raise ValueError('Incompatible Hessian')
#print('neg_log_post',neg_log_post)
return float(neg_log_post)
def g_log_posterior(w, wprior, H, y, X, weights=None):
"""Returns gradient of the negative log posterior probability.
Parameters
----------
w : array-like, shape (p, )
parameter vector at which the gradient is to be evaluated
wprior : array-like, shape (p, )
array of prior means on the parameters to be fit
H : array-like, shape (p, p) or (p, )
array of prior Hessian (inverse covariance of prior distribution of parameters)
y : array-like, shape (N, )
array of binary ({0,1} responses)
X : array-like, shape (N, p)
array of features
weights : array-like, shape (N, )
array of data point weights. Should be within [0,1]
Returns
-------
grad_log_post : array-like, shape (p, )
gradient of negative log posterior
References
----------
Chapter 8 of Murphy, K. 'Machine Learning a Probabilistic Perspective', MIT Press (2012)
Chapter 4 of Bishop, C. 'Pattern Recognition and Machine Learning', Springer (2006)
"""
# fill in weights if need be
if weights is None:
weights = np.ones(len(np.atleast_1d(y)), )
if len(np.atleast_1d(weights)) != len(np.atleast_1d(y)):
raise ValueError(' weight vector must be same length as response vector')
# calculate gradient
mu_ = logistic_prob(X, w)
if len(H.shape) == 2:
grad_log_post = np.dot(X.T, weights * (mu_ - y)) + np.dot(H, (w - wprior))
elif len(H.shape) == 1:
grad_log_post = np.dot(X.T, weights * (mu_ - y)) + H * (w - wprior)
else:
raise ValueError('Incompatible Hessian')
#print('grad_log_post',grad_log_post)
return grad_log_post
def g_log_posterior_small(w, wprior, H, y, X, weights=None):
"""Returns normalized (to 1) gradient of the negative log posterior probability.
This is used for BFGS and L-BFGS-B solvers which tend to not converge unless
the gradient is normalized.
Parameters
----------
w : array-like, shape (p, )
parameter vector at which the gradient is to be evaluated
wprior : array-like, shape (p, )
array of prior means on the parameters to be fit
H : array-like, shape (p, p) or (p, )
array of prior Hessian (inverse covariance of prior distribution of parameters)
y : array-like, shape (N, )
array of binary ({0,1} responses)
X : array-like, shape (N, p)
array of features
weights : array-like, shape (N, )
array of data point weights. Should be within [0,1]
Returns
-------
grad_log_post : array-like, shape (p, )
normalized (to 1) gradient of negative log posterior
References
----------
Chapter 8 of Murphy, K. 'Machine Learning a Probabilistic Perspective', MIT Press (2012)
Chapter 4 of Bishop, C. 'Pattern Recognition and Machine Learning', Springer (2006)
"""
# fill in weights if need be
if weights is None:
weights = np.ones(len(np.atleast_1d(y)), )
if len(np.atleast_1d(weights)) != len(np.atleast_1d(y)):
raise ValueError(' weight vector must be same length as response vector')
# calculate gradient
mu = logistic_prob(X, w)
if len(H.shape) == 2:
grad_log_post = np.dot(X.T, weights * (mu - y)) + np.dot(H, (w - wprior))
elif len(H.shape) == 1:
grad_log_post = np.dot(X.T, weights * (mu - y)) + H * (w - wprior)
else:
raise ValueError('Incompatible Hessian')
# normalize gradient to length 1
grad_log_post = grad_log_post / np.sqrt(np.sum(grad_log_post * grad_log_post))
return grad_log_post
def H_log_posterior(w, wprior, H, y, X, weights=None):
"""Returns Hessian (either full or diagonal) of the negative log posterior probability.
Parameters
----------
w : array-like, shape (p, )
parameter vector at which the Hessian is to be evaluated
wprior : array-like, shape (p, )
array of prior means on the parameters to be fit
H : array-like, shape (p, p) or (p, )
array of log prior Hessian (inverse covariance of prior distribution of parameters)
y : array-like, shape (N, )
array of binary ({0,1} responses)
X : array-like, shape (N, p)
array of features
weights : array-like, shape (N, )
array of data point weights. Should be within [0,1]
Returns
-------
H_log_post : array-like, shape like `H`
Hessian of negative log posterior
References
----------
Chapter 8 of Murphy, K. 'Machine Learning a Probabilistic Perspective', MIT Press (2012)
Chapter 4 of Bishop, C. 'Pattern Recognition and Machine Learning', Springer (2006)
"""
# fill in weights if need be
if weights is None:
weights = np.ones(len(np.atleast_1d(y)), )
if len(np.atleast_1d(weights)) != len(np.atleast_1d(y)):
raise ValueError(' weight vector must be same length as response vector')
# calculate log posterior Hessian
mu = logistic_prob(X, w)
S = mu * (1. - mu) * weights
if len(H.shape) == 2:
H_log_post = np.dot(X.T, X * S[:, np.newaxis]) + H
elif len(H.shape) == 1:
H_log_post = np.diag(np.dot(X.T, X * S[:, np.newaxis])) + H
else:
raise ValueError('Incompatible Hessian')
return H_log_post
def HP_log_posterior(w, q, wprior, H, y, X, weights=None):
"""Returns diagonal Hessian of the negative log posterior probability multiplied by an arbitrary vector.
This is useful for the Newton-CG solver, particularly when we only want to store a diagonal Hessian.
Parameters
----------
w : array-like, shape (p, )
parameter vector at which the Hessian is to be evaluated
q : array-like, shape (p, )
arbitrary vector to multiply Hessian by
wprior : array-like, shape (p, )
array of prior means on the parameters to be fit
H : array-like, shape (p, )
array of diagonal log prior Hessian (inverse covariance of prior distribution of parameters)
y : array-like, shape (N, )
array of binary ({0,1} responses)
X : array-like, shape (N, p)
array of features
weights : array-like, shape (N, )
array of data point weights. Should be within [0,1]
Returns
-------
HP : array-like, shape (p, )
Hessian of log posterior (diagonal approx) multiplied by arbitrary vector
References
----------
Chapter 8 of Murphy, K. 'Machine Learning a Probabilistic Perspective', MIT Press (2012)
Chapter 4 of Bishop, C. 'Pattern Recognition and Machine Learning', Springer (2006)
"""
# fill in weights if need be
if weights is None:
weights = np.ones(len(np.atleast_1d(y)), )
if len(np.atleast_1d(weights)) != len(np.atleast_1d(y)):
raise ValueError(' weight vector must be same length as response vector')
HP = H_log_posterior(w, wprior, H, y, X, weights)
HP = HP * q
return HP
def fit_bayes_logistic(y, X, wprior, H, weights=None, solver='Newton-CG', bounds=None, maxiter=100):
""" Bayesian Logistic Regression Solver. Assumes Laplace (Gaussian) Approximation
to the posterior of the fitted parameter vector. Uses scipy.optimize.minimize
Parameters
----------
y : array-like, shape (N, )
array of binary {0,1} responses
X : array-like, shape (N, p)
array of features
wprior : array-like, shape (p, )
array of prior means on the parameters to be fit
H : array-like, shape (p, p) or (p, )
array of prior Hessian (inverse covariance of prior distribution of parameters)
weights : array-like, shape (N, )
array of data point weights. Should be within [0,1]
solver : string
scipy optimize solver used. this should be either 'Newton-CG', 'BFGS' or 'L-BFGS-B'.
The default is Newton-CG.
bounds : iterable of length p
a length p list (or tuple) of tuples each of length 2.
This is only used if the solver is set to 'L-BFGS-B'. In that case, a tuple
(lower_bound, upper_bound), both floats, is defined for each parameter. See the
scipy.optimize.minimize docs for further information.
maxiter : int
maximum number of iterations for scipy.optimize.minimize solver.
Returns
-------
w_fit : array-like, shape (p, )
posterior parameters (MAP estimate)
H_fit : array-like, shape like `H`
posterior Hessian (Hessian of negative log posterior evaluated at MAP parameters)
References
----------
Chapter 8 of Murphy, K. 'Machine Learning a Probabilistic Perspective', MIT Press (2012)
Chapter 4 of Bishop, C. 'Pattern Recognition and Machine Learning', Springer (2006)
"""
# Check that dimensionality of inputs agrees
# check X
if len(X.shape) != 2:
raise ValueError('X must be a N*p matrix')
(nX, pX) = X.shape
# check y
if len(y.shape) > 1:
raise ValueError('y must be a vector of shape (p, )')
if len(np.atleast_1d(y)) != nX:
raise ValueError('y and X do not have the same number of rows')
# check wprior
if len(wprior.shape) > 1:
raise ValueError('prior should be a vector of shape (p, )')
if len(np.atleast_1d(wprior)) != pX:
raise ValueError('prior mean has incompatible length')
# check H
if len(H.shape) == 1:
if np.atleast_1d(H).shape[0] != pX:
raise ValueError('prior Hessian is diagonal but has incompatible length')
elif len(H.shape) == 2:
(h1,h2) = np.atleast_2d(H).shape
if h1 != h2:
raise ValueError('prior Hessian must either be a p*p square matrix or a vector or shape (p, ) ')
if h1 != pX:
raise ValueError('prior Hessian is square but has incompatible size')
# fill in weights if need be
if weights is None:
weights = np.ones(len(np.atleast_1d(y)), )
if len(np.atleast_1d(weights)) != len(np.atleast_1d(y)):
raise ValueError(' weight vector must be same length as response vector')
# Do the regression
if solver == 'Newton-CG':
if len(H.shape) == 2:
ww = minimize(f_log_posterior, wprior, args=(wprior, H, y, X, weights), jac=g_log_posterior,
hess=H_log_posterior, method='Newton-CG', options={'maxiter': maxiter})
w_fit = ww.x
H_fit = H_log_posterior(w_fit, wprior, H, y, X, weights)
elif len(H.shape) == 1:
ww = minimize(f_log_posterior, wprior, args=(wprior, H, y, X, weights), jac=g_log_posterior,
hessp=HP_log_posterior, method='Newton-CG', options={'maxiter': maxiter})
w_fit = ww.x
H_fit = H_log_posterior(w_fit, wprior, H, y, X, weights)
else:
raise ValueError(' You must either use the full Hessian or its diagonal as a vector')
elif solver == 'BFGS':
ww = minimize(f_log_posterior, wprior, args=(wprior, H, y, X, weights), jac=g_log_posterior_small,
method='BFGS', options={'maxiter': maxiter})
w_fit = ww.x
H_fit = H_log_posterior(w_fit, wprior, H, y, X, weights)
elif solver == 'L-BFGS-B':
ww = minimize(f_log_posterior, wprior, args=(wprior, H, y, X, weights), jac=g_log_posterior_small,
method='L-BFGS-B', bounds=bounds, options={'maxiter': maxiter})
w_fit = ww.x
H_fit = H_log_posterior(w_fit, wprior, H, y, X, weights)
else:
raise ValueError('Unknown solver specified: "{0}"'.format(solver))
print("Sampled w:", ww.x)
#print(ww.shape)
return w_fit, H_fit
def bayes_logistic_prob(X, w, H):
""" Posterior predictive logistic regression probability. Uses probit approximation
to the logistic regression sigmoid. Also has overflow prevention via exponent truncation.
Parameters
----------
X : array-like, shape (N, p)
array of covariates
w : array-like, shape (p, )
array of fitted MAP parameters
H : array-like, shape (p, p) or (p, )
array of log posterior Hessian (covariance matrix of fitted MAP parameters)
Returns
-------
pr : array-like, shape (N, )
moderated (by full distribution) logistic probability
References
----------
Chapter 8 of Murphy, K. 'Machine Learning a Probabilistic Perspective', MIT Press (2012)
Chapter 4 of Bishop, C. 'Pattern Recognition and Machine Learning', Springer (2006)
"""
# set a truncation exponent
trunc = 8. # exp(8)/(1+exp(8)) = 0.9997 which is close enough to 1 as to not matter in most cases.
# unmoderated argument of exponent
z_a = np.dot(X, w)
# find the moderation
if len(H.shape) == 2:
H_inv_ = np.linalg.inv(H)
sig2_a = np.sum(X * np.dot(H_inv_, X.T).T, axis=1)
elif len(H.shape) == 1:
H_inv_ = 1. / H
sig2_a = np.sum(X * (H_inv_ * X), axis=1)
else:
raise ValueError(' You must either use the full Hessian or its diagonal as a vector')
print('H inverse',H_inv_)
print('sig2_a',sig2_a)
# get the moderation factor. Implicit in here is approximating the logistic sigmoid with
# a probit by setting the probit and sigmoid slopes to be equal at the origin. This is where
# the factor of pi/8 comes from.
kappa_sig2_a = 1. / np.sqrt(1. + 0.125 * np.pi * sig2_a)
# calculate the moderated argument of the logit
z = z_a * kappa_sig2_a
# do a truncation to prevent exp overflow
z = np.clip(z, -trunc, trunc)
# get the moderated logistic probability
pr = np.exp(z)
pr = pr / (1. + pr)
return pr
For Bayesian Logistic Regression, we used three solvers: Newton, BFGS, L-BFGS-B. This train-test split was used for all three solvers.
# The data is divided into training and test sets
training_X, test_X, training_y, test_y = train_test_split(X,y, test_size=0.25, random_state=500)
We have used Scipy.optimize.minimize function for Minimization of scalar function of one or more variables using the parameter method = ‘Newton-CG'
# Train the model for Newton CG
w_prior = np.zeros(n_features + 1)
H_prior = np.diag(np.ones(n_features + 1))*0.001
GD_BATCH_SIZE = len(training_X)
ITERATION_CNT = 10
w = training_X.shape[1]
w_prior = np.zeros(w)
H_prior = np.diag(np.ones(w))*0.001
for i in range(0, ITERATION_CNT):
for idx in range(0, len(training_X), GD_BATCH_SIZE):
batch_size = GD_BATCH_SIZE if (idx + GD_BATCH_SIZE) < len(training_X) else len(training_X) - idx
print("batch_size",batch_size)
print("GD_BATCH_SIZE",GD_BATCH_SIZE)
w_posterior, H_posterior = fit_bayes_logistic(training_y[idx:batch_size],
training_X[idx:batch_size,:],
w_prior, H_prior)
w_prior = copy.copy(w_posterior)
H_prior = copy.copy(H_posterior)
w_fit = w_prior
H_fit = H_prior
We have got 10 sampled w.
# Perform Test for Newton CG
test_probs = bayes_logistic_prob(test_X, w_fit, H_fit)
df_Newton=pd.DataFrame(test_probs,columns=["y_prob"])
df_Newton["y"]=test_y
def PredictionPopularity(cols):
if cols>0.5:
return 1
else:
return 0
df_Newton['y_hat'] = df_Newton["y_prob"].apply(PredictionPopularity)
df_Newton.head()
print(confusion_matrix(df_Newton["y"],df_Newton["y_hat"]))
print(classification_report(df_Newton["y"],df_Newton["y_hat"]))
#Mean accuracy from confusion matrix
print("Accuracy:", (36890+4640)/66280)
We have used Scipy.optimize.minimize function for Minimization of scalar function of one or more variables using the parameter method =’BFGS’
#Training the model
w_prior = np.zeros(n_features + 1)
H_prior = np.diag(np.ones(n_features + 1))*0.001
GD_BATCH_SIZE = len(training_X)
ITERATION_CNT = 10
w = training_X.shape[1]
w_prior = np.zeros(w)
H_prior = np.diag(np.ones(w))*0.001
for i in range(0, ITERATION_CNT):
for idx in range(0, len(training_X), GD_BATCH_SIZE):
batch_size = GD_BATCH_SIZE if (idx + GD_BATCH_SIZE) < len(training_X) else len(training_X) - idx
print("batch_size",batch_size)
print("GD_BATCH_SIZE",GD_BATCH_SIZE)
w_posterior, H_posterior = fit_bayes_logistic(training_y[idx:batch_size],
training_X[idx:batch_size,:],
w_prior, H_prior, solver="BFGS")
w_prior = copy.copy(w_posterior)
H_prior = copy.copy(H_posterior)
w_fit_BFGS = w_prior
H_fit_BFGS = H_prior
We have sampled 10 w.
# Perform Test for BFGS
test_probs = bayes_logistic_prob(test_X, w_fit_BFGS, H_fit_BFGS)
df_BFGS=pd.DataFrame(test_probs,columns=["y_prob"])
df_BFGS["y"]=test_y
df_BFGS['y_hat'] = df_BFGS["y_prob"].apply(PredictionPopularity)
df_BFGS.head()
print(confusion_matrix(df_BFGS["y"],df_BFGS["y_hat"]))
print(classification_report(df_BFGS["y"],df_BFGS["y_hat"]))
#Mean accuracy from confusion matrix
print("Accuracy:", (36890+4640)/66280)
We have used Scipy.optimize.minimize function for Minimization of scalar function of one or more variables using the parameter method =’LBFGSB’
# fit via L-BFGS-B so we can impose bounds
# first make a list of tuples with the bounds for each parameter in it
# (see the scipy.optimize.minimize docs)
bnd = (-0.25,0.25)
bnd_list = [(None,None)]
for _ in np.arange(11-1):
bnd_list.append(bnd)
# Train the model
w_prior = np.zeros(n_features + 1)
H_prior = np.diag(np.ones(n_features + 1))*0.001
GD_BATCH_SIZE = len(training_X)
ITERATION_CNT = 10
w = training_X.shape[1]
w_prior = np.zeros(w)
H_prior = np.diag(np.ones(w))*0.001
for i in range(0, ITERATION_CNT):
for idx in range(0, len(training_X), GD_BATCH_SIZE):
batch_size = GD_BATCH_SIZE if (idx + GD_BATCH_SIZE) < len(training_X) else len(training_X) - idx
print("batch_size",batch_size)
print("GD_BATCH_SIZE",GD_BATCH_SIZE)
w_posterior, H_posterior = fit_bayes_logistic(training_y[idx:batch_size],
training_X[idx:batch_size,:],
w_prior, H_prior, solver='L-BFGS-B', bounds = bnd_list)
w_prior = copy.copy(w_posterior)
H_prior = copy.copy(H_posterior)
w_fit_LBFGSB = w_prior
H_fit_LBFGSB = H_prior
We have sampled 10 w.
# Perform Test for LBFGSB
test_probs = bayes_logistic_prob(test_X, w_fit_LBFGSB, H_fit_LBFGSB)
df_LBFGSB=pd.DataFrame(test_probs,columns=["y_prob"])
df_LBFGSB["y"]=test_y
df_LBFGSB['y_hat'] = df_BFGS["y_prob"].apply(PredictionPopularity)
df_LBFGSB.head()
print(confusion_matrix(df_LBFGSB["y"],df_LBFGSB["y_hat"]))
print(classification_report(df_LBFGSB["y"],df_LBFGSB["y_hat"]))
#Mean accuracy from confusion matrix
print("Accuracy:", (36890+4640)/66280)
#Mean accuracy from confusion matrix
Best_Accuracy_BayesianLogistic = (36890+4640)/66280
Best_Accuracy_BayesianLogistic
plt.plot(w_fit_BFGS,'rs-', label = 'BFGS')
plt.plot(w_fit_LBFGSB,'gs-', label = 'L-BFGS-B with imposed bounds')
plt.plot(w_fit,'ks-', label = 'Newton-CG')
plt.plot((0, 10), (0.25, 0.25), 'k--')
plt.plot((0, 10), (-0.25, -0.25), 'k--')
plt.ylim([-1.5,1.0])
plt.legend(loc = 'lower right')
plt.title('Posterior Parameters')
plt.subplot(1,3,1)
plt.imshow(H_fit)
plt.title('Newton')
plt.subplot(1,3,2)
plt.imshow(H_fit_BFGS)
plt.title('BFGS')
plt.subplot(1,3,3)
plt.imshow(H_fit_LBFGSB)
plt.title('L-BFGS-B')
After logistic regression(LR), we applied SVM. SVM minimizes hinge loss while logistic regression minimizes logistic loss. LR is more sensitive to outliers than SVM because the cost function of LR (hinge loss) diverges faster than that of SVM. As our data had outliers, we wanted to see if the presence of those outliers impacted the performance of LR on our data. This was the reason why we chose to apply SVM after LR.
from sklearn.svm import SVC
#SVM – linear kernel
SVM_Linear = SVC(kernel='linear',random_state=0)
SVM_Linear.fit(X_train,y_train)
predictions = SVM_Linear.predict(X_test)
print(confusion_matrix(y_test,predictions))
print(classification_report(y_test,predictions))
#SVM – default kernel
SVM_RBF = SVC(probability= True,random_state=0)
SVM_RBF.fit(X_train,y_train)
predictions2 = SVM_RBF.predict(X_test)
print(confusion_matrix(y_test,predictions2))
print(classification_report(y_test,predictions2))
For the basic model of SVM, we ran it with both default and linear kernel. The default kernel gave us a higher accuracy score of 0.6409625829812915. To further improve the score, we used RandomizedSearchCV. We did not use GridSearchCV as it is computationally very expensive for SVM.
from sklearn.model_selection import RandomizedSearchCV
from scipy import stats
from sklearn.metrics import make_scorer, roc_auc_score
auc = make_scorer(roc_auc_score)
rand_list = {"C": stats.uniform(2, 10),
"gamma": stats.uniform(0.1, 1)}
rand_search = RandomizedSearchCV(SVC(), param_distributions = rand_list, n_iter = 5, n_jobs = 4, cv = 2, random_state = 0, scoring = auc)
rand_search.fit(X_train_DT,y_train)
rand_search.cv_results_
rand_search.best_score_
RandomizedSearchCV did not help us in improving the accuracy score and thus we concluded that, the base model of SVM, with default kernel, gave us the highest accuracy score.
#Ramdomized Search CV did not help us in improving the accuracy
Best_Accuracy_SVM= SVM_RBF.score(X_test,y_test)
Best_Accuracy_SVM
Next, we applied Decision Tree (D-Tree). It helps to quantify the values & probability of each possible outcome. Additionally, it is easy to implement, understand & visualize. The reason why we used this model on our data was to see if D-tree can help us in accurately predicting the target class based on all possible outcomes.
from sklearn.tree import DecisionTreeClassifier
dtree = DecisionTreeClassifier()
dtree.fit(X_train,y_train)
predictions = dtree.predict(X_test)
print(confusion_matrix(y_test,predictions))
print(classification_report(y_test,predictions))
The base model of Decision Tree gave us an accuracy score of 0.6327549788774894. As Decision Trees suffer from the problem of overfitting, we tried to solve this problem by running Random Forest algorithm on our data, next.
Best_Accuracy_DTree=dtree.score(X_test,y_test)
Best_Accuracy_DTree
As D-tree can suffer from the problem of our fitting, we next chose to run ensemble method, Random Forest, on our data. Random forest builds multiple D- trees & merges them to get more accurate & stable prediction. We used it to our data because it de-correlates the trees- resulting in less variable and more reliable trees and thus, eliminates the problem of overfitting problem, which is generally experienced if we use decision trees.
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators=100)
rfc.fit(X_train, y_train)
rfc_pred = rfc.predict(X_test)
print(confusion_matrix(y_test,rfc_pred))
print(classification_report(y_test,rfc_pred))
rfc.score(X_test,y_test)
The base Random Forest model gave us the accuracy score of 0.7002564876282438. To further improve it, we did RandomizedSearchCV. We did not do GridSearchCV as it is computationally very expensive for Random Forest model.
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]
# Create the random grid
random_grid = {'n_estimators': n_estimators,
'max_features': max_features,
'max_depth': max_depth,
'min_samples_split': min_samples_split,
'min_samples_leaf': min_samples_leaf,
'bootstrap': bootstrap}
# Use the random grid to search for best hyperparameters
# First create the base model to tune
# Random search of parameters, using 3 fold cross validation,
# search across 100 different combinations, and use all available cores
rf_random = RandomizedSearchCV(estimator = rfc, param_distributions = random_grid, n_iter = 3, cv = 3, verbose=2, random_state=42, n_jobs = -1)
# Fit the random search model
rf_random.fit(X_train,y_train)
rf_random.score(X_test,y_test)
Though, RandomizedSearchCV gave us the results, it took a lot of time to run. To rectify this problem, we did feature selection and chose 5 most important features.
from sklearn.ensemble import ExtraTreesClassifier
import matplotlib.pyplot as plt
model = ExtraTreesClassifier()
model.fit(X_train,y_train)
print(model.feature_importances_) #use inbuilt class feature_importances of tree based classifiers
#plot graph of feature importances for better visualization
feat_importances = pd.Series(model.feature_importances_, index=X_train.columns)
feat_importances.nlargest(5).plot(kind='barh')
plt.show()
Using these 5 features, we ran Random Forest and RandomizedSearchCV
X_train_DT = X_train[["loudness","acousticness","energy","instrumentalness","duration_ms"]]
X_test_DT= X_test[["loudness","acousticness","energy","instrumentalness","duration_ms"]]
rfc_reduced = RandomForestClassifier(n_estimators=100)
rfc_reduced.fit(X_train_DT,y_train)
rfc_pred_reduced = rfc_reduced.predict(X_test_DT)
print(confusion_matrix(y_test,rfc_pred_reduced))
print(classification_report(y_test,rfc_pred_reduced))
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]
# Create the random grid
random_grid = {'n_estimators': n_estimators,
'max_features': max_features,
'max_depth': max_depth,
'min_samples_split': min_samples_split,
'min_samples_leaf': min_samples_leaf,
'bootstrap': bootstrap}
print(random_grid)
# Use the random grid to search for best hyperparameters
# First create the base model to tune
#rf = RandomForestRegressor()
# Random search of parameters, using 3 fold cross validation,
# search across 100 different combinations, and use all available cores
rf_random_reduced = RandomizedSearchCV(estimator = rfc_reduced, param_distributions = random_grid, n_iter = 3, cv = 3, verbose=2, random_state=42, n_jobs = -1)
# Fit the random search model
rf_random_reduced.fit(X_train_DT,y_train)
print(rf_random_reduced.best_score_)
print(rf_random_reduced.best_params_)
print(rf_random_reduced.best_estimator_)
If we compare the accuracy score of the base model and different methods we used to tune our model, we see that we achieved the best accuracy from the model ran using RandomizedSearchCV with all features i.e. 0.7048129149064575
Best_Accuracy_RForest = rf_random.score(X_test,y_test)
#Using Randomized Grid search CV on all the features, our accuracy improved
Finally, we applied KNN on our data because it is based on feature similarity and is effective for large training data.
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train,y_train)
pred = knn.predict(X_test)
print(confusion_matrix(y_test,pred))
print(classification_report(y_test,pred))
We initiated this model with k=1. However, to find the optimal k, we plotted the misclassification (error rate) against different values of K, ranging from 1 to 30. Based on the graph, we chose K = 20 as it gave the lowest error rate. Similar rate was being given by k=24, however, the difference between the error rate does not suffice to select a larger K.
error_rate = []
for i in range(1,30):
knn = KNeighborsClassifier(n_neighbors=i)
knn.fit(X_train,y_train)
pred_i = knn.predict(X_test)
error_rate.append(np.mean(pred_i != y_test))
plt.figure(figsize=(10,6))
plt.plot(range(1,30),error_rate,color='blue', linestyle='dashed', marker='o',
markerfacecolor='red', markersize=10)
plt.title('Error Rate vs. K Value')
plt.xlabel('K')
plt.ylabel('Error Rate')
plt.xlim(0,30)
# NOW WITH K=20
knn = KNeighborsClassifier(n_neighbors=20)
knn.fit(X_train,y_train)
pred = knn.predict(X_test)
print('WITH K=20')
print('\n')
print(confusion_matrix(y_test,pred))
print('\n')
print(classification_report(y_test,pred))
knn.score(X_test,y_test)
The base model of KNN (k=20) gave us an accuracy score of 0.6401176825588413. To further improve the accuracy score, we did GridSearchCV and RandomziedSearchCV.
from sklearn.model_selection import GridSearchCV
#create new a knn model
knn2 = KNeighborsClassifier()
#create a dictionary of all values we want to test for n_neighbors
param_grid = {'n_neighbors': np.arange(1, 25)}
#use gridsearch to test all values for n_neighbors
knn_gscv = GridSearchCV(knn2, param_grid, cv=5)
#fit model to data
knn_gscv.fit(X_train, y_train)
#check top performing n_neighbors value
knn_gscv.best_params_
#check mean score for the top performing value of n_neighbors
knn_gscv.best_score_
# specify "parameter distributions" rather than a "parameter grid"
k_range=(1,30)
weight_options = ['uniform', 'distance']
# since both parameters are discrete, so param_dist is the same as param_grid
param_dist = dict(n_neighbors=k_range, weights=weight_options)
# if parameters are continuous (like regularization)
# n_iter controls the number of searches
# instantiate model
# 2 new params
# n_iter --> controls number of random combinations it will try
# random_state for reproducibility
rand = RandomizedSearchCV(knn, param_dist, cv=10, scoring='accuracy', n_iter=10, random_state=5)
# fit
rand.fit(X_train, y_train)
# examine the best model
print(rand.best_score_)
print(rand.best_params_)
print(rand.best_estimator_)
Based on the accuracy score, we concluded that RandomizedSearchCV gave us the best accuracy score for KNN.
Best_Accuracy_KNN=rand.best_score_
For the 1st objective of the project, we wanted to evaluate all models together and see which model outperformed other. For this, we used two methods. -AUC Curve -Bar chart of best accuracy score obtained from each model
from sklearn.metrics import make_scorer, roc_auc_score
from sklearn.metrics import roc_curve
y_pred_proba_Logistic = grid_result.predict_proba(X_test)[::,1]
fpr_Logistic, tpr_Logistic, _ = roc_curve(y_test, y_pred_proba_Logistic)
auc_logistic = round(roc_auc_score(y_test, y_pred_proba_Logistic),3)
y_pred_proba_DTree=dtree.predict_proba(X_test)[::,-1]
fpr_DTree, tpr_DTree, _ = roc_curve(y_test, y_pred_proba_DTree)
auc_DTree = round(roc_auc_score(y_test, y_pred_proba_DTree),4)
y_pred_proba_RForest=rf_random.predict_proba(X_test)[::,-1]
fpr_RForest, tpr_RForest, _ = roc_curve(y_test, y_pred_proba_RForest)
auc_RForest = round(roc_auc_score(y_test, y_pred_proba_RForest),4)
y_pred_proba_KNN=knn.predict_proba(X_test)[::,-1]
fpr_KNN, tpr_KNN, _ = roc_curve(y_test, y_pred_proba_KNN)
auc_KNN = round(roc_auc_score(y_test, y_pred_proba_KNN),4)
y_pred_proba_SVM=SVM_RBF.predict_proba(X_test)[::,-1]
fpr_SVM, tpr_SVM, _ = roc_curve(y_test, y_pred_proba_SVM)
auc_SVM = round(roc_auc_score(y_test, y_pred_proba_SVM),4)
plt.plot(fpr_RForest,tpr_RForest,label="R-Forest, auc="+str(auc_RForest))
plt.plot(fpr_Logistic,tpr_Logistic,label="Logistic, auc="+str(auc_logistic))
plt.plot(fpr_KNN,tpr_KNN,label="R-KNN, auc="+str(auc_KNN))
plt.plot(fpr_DTree,tpr_DTree,label="D-Tree, auc="+str(auc_DTree))
plt.plot(fpr_SVM,tpr_SVM,label="SVM, auc="+str(auc_SVM))
plt.legend(loc=4)
plt.show()
Accuracy_List= [Best_Accuracy_Logistic,Best_Accuracy_BayesianLogistic,Best_Accuracy_DTree,Best_Accuracy_RForest,Best_Accuracy_KNN,Best_Accuracy_SVM]
Algorithms_Name= ["Logistic Regression", "Bayesian Logistic Regression", "Decision Tree", "Random Forest", "KNN", "SVM"]
Accuracy_df=pd.DataFrame(Algorithms_Name, columns=["Algorithms"])
Accuracy_df["Accuracy"] = Accuracy_List
Accuracy_df
Accuracy_df.dtypes
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import seaborn as sns
plt.figure(figsize=(16,5))
ax = sns.barplot(x="Algorithms", y="Accuracy", data=Accuracy_df)
ax.yaxis.set_ticks(np.arange(0.0, 1.0, 0.05))
From both the methods, we had used above, we concluded that Random Forest works best for our data.
Many music streaming services use primarily relational algorithms, which simply suggest songs from similar users’ playlists. And often, these music recommendations are not indicative of the kinds of music that the users enjoy. The objective for the second part of this project was to design a personalized song recommender system based on song’s audio feature.
For the purpose of our analysis we used the music dataset (consisting of songs from 1991 to 2018) containing song’s audio feature to identify song themes (using k-means clustering). We also had access to the user dataset containing 30 users like/dislike on selected songs. Since, there wasn’t a direct link between these two datasets, we used Mahalanobis method to link them together to get theme for each song in the user dataset. Finally, we designed a recommender system which recommends songs to a user if the selected song falls into their preferred theme.
Unsupervised Learning Approach
Image(filename='Part_2_Approach.PNG')
Music files from 1991 to 2018 were combined into a single file called "MergedMusic". There was basic data cleaning performed to get rid of irrelevant columns and null values for the purpose of this analysis.
#importing music data (merged 28 files)
MergedMusic = pd.read_csv(r'MergedMusic.csv', encoding='ISO-8859-1')
#viewing data for understanidng purpose
MergedMusic.head()
There are 26 columns and 265,120 rows in the "MergedMusic" dataset. "Preview url" was removed as it had 35,105 null values and further it does not add any value to this analysis. Also, four more columns ("danceability", "speechiness", "valence" and "time_signature") had one null value each, so it made more sense to get rid of these rows from "MergedMusic" dataset.
MergedMusic.shape
MergedMusic.isnull().sum()
plt.figure(figsize=(16,7))
sns.heatmap(MergedMusic.isnull(),cmap='viridis',cbar=False)
There are no more null values in the dataset as it can be seen from above summary. The new DataFrame is called “MergedMusic_clean”.
#keeping relevant columns. Need to explain why we removed the other columns.
MergedMusic_clean = MergedMusic
drop_col = ['analysis_url', 'preview_url', 'track_href', 'time_signature']
MergedMusic_clean.drop(drop_col, inplace = True, axis =1)
MergedMusic_clean = MergedMusic_clean.dropna()
MergedMusic_clean.isnull().sum()
Below four scenarios were considered to find the sceario with most apprpriate clusters and features to be used for the personalised recommender system. Scenario 4 (Selected features without PCA) was used for the recommender system. Features like tempo, key, loudness, mode, instrumentalness, speechiness were removed as they did not uniquly identify the clusters shown below.
Image(filename='Part_2_Radarchart.PNG')
clustering relies on distance, the scale of the data will affect the results. This means that weight will be dominant in determining the clusters. The features were standardized to make sure no one feature is influencing the clusters.
Scaling: All Audio Features
A new DataFrame called "df_cluster_all" is created. This consists of all auido features. These features were then scaled using StandardScaler funtion. The idea behind StandardScaler is that it will transform the data such that its distribution will have a mean value 0 and standard deviation of 1.
#All audio features
cluster_features_all = ['acousticness','danceability','energy','instrumentalness','key','liveness','loudness','mode',
'speechiness','tempo','valence']
df_cluster_all = MergedMusic_clean[cluster_features_all]
df_cluster_all_scale = np.array(df_cluster_all)
scaler = StandardScaler()
scaler.fit(df_cluster_all_scale)
df_cluster_all_scale = scaler.transform(df_cluster_all_scale)
Scaling: Selected Audio Features
A new DataFrame called "df_cluster_selected" is created. This consists of all auido features. These features were then scaled using StandardScaler funtion. The idea behind StandardScaler is that it will transform the data such that its distribution will have a mean value 0 and standard deviation of 1.
#Selected audio features
cluster_features_selected = ['acousticness','danceability','energy','liveness','valence']
df_cluster_selected = MergedMusic_clean[cluster_features_selected]
df_cluster_selected_scale = np.array(df_cluster_selected)
scaler = StandardScaler()
scaler.fit(df_cluster_selected_scale)
df_cluster_selected_scale = scaler.transform(df_cluster_selected_scale)
Defining Radar chart code. Reference: https://python-graph-gallery.com/392-use-faceting-for-radar-chart/
def make_radar(row, title, color, dframe, num_clusters):
# number of variable
categories=list(dframe)[1:]
N = len(categories)
angles = [n / float(N) * 2 * pi for n in range(N)]
angles += angles[:1]
# Initialise the radar plot
ax = plt.subplot(2,ceil(num_clusters/2),row+1, polar=True, )
# If you want the first axis to be on top:
ax.set_theta_offset(pi / 2)
ax.set_theta_direction(-1)
# Draw one axe per variable + add labels labels yet
plt.xticks(angles[:-1], categories, color='grey', size=14)
# Draw ylabels
ax.set_rlabel_position(0)
plt.yticks([0.2,0.4,0.6,0.8], ["0.2","0.4","0.6","0.8"], color="grey", size=8)
plt.ylim(0,1)
# Ind1
values=dframe.loc[row].drop('cluster').values.flatten().tolist()
values += values[:1]
ax.plot(angles, values, color=color, linewidth=2, linestyle='solid')
ax.fill(angles, values, color=color, alpha=0.4)
# Add a title
plt.title(title, size=16, color=color, y=1.06)
K-means clustering is used in the unsupervised learning to design a personalised song recommender system. K-means clustering method requires explicit number of clusters (k) to be defined. One approach is to specify the value of k based on knowledge. Another way is to use elbow method to determine the number of clusters.
Each point is taken and its sqaure distance is measured to thier cluster centroids, called sum of sqaure distance (SSD). This process is repeated for all the data points and for each value of k. SSD measures how close each points are to the cluster centroid. Therefore, the smaller the SSD, the closer the points are in the same cluster.
Elbow Method: All Audio Features
ss_dist = []
K = range(1,20)
for k in K:
km = KMeans(n_clusters=k, max_iter=10, init='k-means++', random_state=123)
km = km.fit(df_cluster_all_scale)
ss_dist.append(km.inertia_)
plt.plot(K, ss_dist, 'bx-')
plt.xlabel('k')
plt.ylabel('Sum of squared distances')
plt.title('Elbow Method For Optimal k')
plt.show()
Elbow Method: Selected Audio Features
ss_dist = []
K = range(1,20)
for k in K:
km = KMeans(n_clusters=k, max_iter=10, init='k-means++', random_state=123)
km = km.fit(df_cluster_selected_scale)
ss_dist.append(km.inertia_)
plt.plot(K, ss_dist, 'bx-')
plt.xlabel('k')
plt.ylabel('Sum of squared distances')
plt.title('Elbow Method For Optimal k')
plt.show()
From the above elbow method for both all and selected audio feature, it can be concluded that the optimal number of clusters is 4 as there is a considerable amount of reduction in total within-clusters sum of squared variation.
In the first scenario, all the features are considered. The value of k is specifed to be 4 based on the above elbow method analysis.
num_clusters = 4
kmeanModel_s1 = KMeans(n_clusters=num_clusters, max_iter=100, init='k-means++', random_state=123).fit(df_cluster_all_scale)
Visualisation of cluster
Here all the features are considered and hence the number of components in PCA is taken as 11.
pca_s1 = PCA(n_components=11, random_state=123)
pca_results_s1 = pca_s1.fit_transform(df_cluster_all_scale)
print(pca_s1.explained_variance_ratio_.sum())
pca_s1.explained_variance_ratio_.cumsum()
df_scree_s1 = pd.DataFrame({'Component': ['1','2','3','4','5','6','7','8','9','10','11'],'Indiv':pca_s1.explained_variance_ratio_})
df_scree_s1['cum_sum'] = df_scree_s1['Indiv'].cumsum()
df_scree_s1
fig, ax = plt.subplots(figsize=(15, 10))
plt.bar(range(len(pca_s1.explained_variance_ratio_)), pca_s1.explained_variance_ratio_,
label='Individual', axes=ax, alpha=0.4)
plt.plot(range(len(pca_s1.explained_variance_ratio_)), pca_s1.explained_variance_ratio_.cumsum(),
label='Cumulative', color='tomato', axes=ax, marker='o')
ax.set_xticks(range(0,11))
ax.set_xticklabels(range(1,11), fontsize=12)
ax.set_yticklabels(range(0,90,10), fontsize=12)
plt.title('Plot of PCA - Scenario 1 (All feature without PCA)', fontsize=12)
plt.ylabel('Explained variance (%)', fontsize=12)
plt.xlabel('Principal components', fontsize=12)
plt.legend()
plt.show()
It can be seen from the above graph that 85% of the variance in data is explained by 6 components.
df_pca_s1 = pd.DataFrame(pca_results_s1)
df_pca_s1.columns = ['PC1', 'PC2','PC3','PC4','PC5','PC6','PC7','PC8','PC9','PC10','PC11']
df_pca_s1['label'] = kmeanModel_s1.labels_
df_pca_s1.head()
2-D Scatter Plot of Scenario 1
sns.set_style('white')
sns.scatterplot(data=df_pca_s1, x='PC1', y='PC2', hue='label', palette='Set1')
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.title('Visualisation of Songs')
plt.show()
It can be seen from the above cluster plot that there is some overlap between the red and grey cluster. Clusters are overlaping even further when the value of k is increased in the k-means algorithm.
3-D Scatter Plot of Scenario 1
from matplotlib import pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure()
ax = Axes3D(fig)
ax.scatter(df_pca_s1['PC1'], df_pca_s1['PC2'], df_pca_s1['PC3'],
c=df_pca_s1['label'], cmap='Set1')
ax.set_xlabel('PC1')
ax.set_ylabel('PC2')
ax.set_zlabel('PC3')
plt.show()
Song feature distribution compared to overall distribution
MergedMusic_clean.loc[:,'cluster'] = kmeanModel_s1.labels_
# set binning intervals of 0.1
bins = np.linspace(0,1,10)
# create subplots
num_features = len(cluster_features_all)
f, axes = plt.subplots(num_clusters, num_features,
figsize=(20, 10), sharex='col')
# initialise 1st row. Representing cluster
row = 0
for cluster in np.sort(MergedMusic_clean['cluster'].unique()):
# filter cluster dataframe and reset column to 0 for 1st plot for the cluster
df_cluster_all_scale = MergedMusic_clean[MergedMusic_clean['cluster'] == cluster]
col = 0
for feature in cluster_features_all:
# create binned count for all recent df and clustered df
rec_grp = MergedMusic_clean.groupby(pd.cut(MergedMusic_clean[feature], bins)).size().reset_index(name='count')
cluster_grp = df_cluster_all_scale.groupby(pd.cut(df_cluster_all_scale[feature], bins)).size().reset_index(name='count')
# plot overall distribution and cluster distribution on the ax
sns.barplot(data=rec_grp, x=feature, y='count',
color='grey', ax=axes[row, col])
sns.barplot(data=cluster_grp, x=feature, y='count',
color='red', ax=axes[row, col])
# configure ax
axes[row, col].set_xlabel('')
axes[row, col].set_xticklabels(range(1,10), fontsize=12)
if col > 0:
axes[row, col].set_ylabel('')
else:
axes[row, col].set_ylabel('count', fontsize=12)
if row == 0:
axes[row, col].set_title(feature, fontsize=14)
col += 1
row += 1
f.suptitle('Profile for each clusters')
plt.show()
In the above graph, each row represents the cluster, 0 to 3, and each column represents the feature. The grey bar represents the distribution of the feature. This gives a rough idea of the distribution of the feature. The red bar represents the distribution of the feature in that cluster which is used to compare against the other clusters.
Looking at the distribution of each cluster, it can be seen that each cluster is high or low in certain features. This is identified by whether the red bar is on the right(high) or left(low) with respect to the grey bar. From these characteristics, songs can be profiled into different themes.
Interpretation of each feature:
Acousticness: This feature is uniquely identifying the clusters as it can be seen from the above graph. It is low for cluster 0 and 3 while high for cluster 1. It has no effect on cluster 2.
Danceability: This feature is uniquely identifying the clusters as it can be seen from the above graph. It is high for cluster 0 while in the middle for cluster 1 and 3. It has very little effect on cluster 2.
Energy: This feature is uniquely identifying the clusters as it can be seen from the above graph. It is high for cluster 0 and 1 while low for cluster 1. It has no effect on cluster 2.
Instrumentalness: This feature is not uniquely identifying the clusters as it can be seen from the above graph. The distribution seems similar across all the clusters.
Key: This feature is not uniquely identifying the clusters as it can be seen from the above graph. The distribution seems similar across all the clusters.
Liveness: This feature is uniquely identifying the clusters as it can be seen from the above graph. It is low for cluster 0 and 1. It has no effect on cluster 2 and is distributed across for cluster 3.
Loudness: This feature is not uniquely identifying the clusters as it can be seen from the above graph. The distribution seems similar across all the clusters.
Mode: This feature is not uniquely identifying the clusters as it can be seen from the above graph. The distribution seems similar across all the clusters.
Speechiness: This feature is not uniquely identifying the clusters as it can be seen from the above graph. The distribution seems similar across all the clusters.
Tempo: This feature is not uniquely identifying the clusters as it can be seen from the above graph. The distribution seems similar across all the clusters.
Valence: This feature is uniquely identifying the clusters as it can be seen from the above graph. It is high for cluster 0 while low for cluster 1. It has no effect on cluster 2 and is distributed across for cluster 3.
Average song features per cluster
It is easier to view the difference between all the clusters at a glance. So, radar chat of average of cluster feature was plotted.
# calculate mean of each variable
radar_col = cluster_features_all + ['cluster']
# feature average for each cluster as a radar chart
df_radar_all = MergedMusic_clean[radar_col]
df_radar_all = df_radar_all.groupby('cluster').mean().reset_index()
# initialize the figure
plt.figure(figsize=(24,15))
# Create a color palette:
my_palette = plt.cm.get_cmap("Set1", len(df_radar_all.index))
# Create cluster name
title_list = ['Chill Vibes', 'lyrical', 'Dance (Rock/Pop)', 'Folk']
# Loop to plot
for row in range(0, len(df_radar_all.index)):
make_radar(row=row, title=str(df_radar_all['cluster'][row]+1) + ' : ' + title_list[row],
color=my_palette(row), dframe=df_radar_all, num_clusters=len(df_radar_all.index))
title=str(df_radar_all['cluster'][row]) + ' : ' + title_list[row],
# Show plot
plt.show()
The readings of the radar chart is similar to the profile given above.
Interpretation of each Cluster:
Cluster 0 (1: Chill Vibes):
High: Danceability, Energy, Key, Mode, Tempo, Valence.
Low: Acousticness, Instrumentalness, Liveness, Loudness, Speechiness.
Cluster 1 (2: Lyrical):
High: Acousticness, Key, Mode, Tempo.
Low: Danceability, Energy, Instrumentalness, Liveness, Loudness, Speechiness, Valence.
Cluster 2 (3: Dance (Rock/Pop)):
High: Acousticness, Danceability, Energy, Key, Liveness, Mode, Speechiness, Tempo, Valence.
Low: Instrumentalness, Loudness.
Cluster 3 (4: Folk):
High: Energy, Key, Mode, Tempo.
Low: Acousticness, Danceability, Instrumentalness, Liveness, Loudness, Speechiness, Valence.
#All audio features
cluster_features_all = ['acousticness','danceability','energy','instrumentalness','key','liveness','loudness','mode',
'speechiness','tempo','valence']
df_cluster_all = MergedMusic_clean[cluster_features_all]
df_cluster_all_scale = np.array(df_cluster_all)
scaler = StandardScaler()
scaler.fit(df_cluster_all_scale)
df_cluster_all_scale = scaler.transform(df_cluster_all_scale)
num_clusters = 4
kmeanModel_s2 = KMeans(n_clusters=num_clusters, max_iter=100, init='k-means++', random_state=123).fit(df_cluster_all_scale)
Visualisation of cluster
Here all the features are considered and the number of components in PCA is taken as 6. This is based on sceario 1, where it was shown that 6 components were able to explain 85% of the variance.
pca_s2 = PCA(n_components=6, random_state=123)
pca_results_s2 = pca_s2.fit_transform(df_cluster_all_scale)
print(pca_s2.explained_variance_ratio_.sum())
pca_s2.explained_variance_ratio_.cumsum()
df_scree_s2 = pd.DataFrame({'Component': ['1','2','3','4','5','6'],'Indiv':pca_s2.explained_variance_ratio_})
df_scree_s2['cum_sum'] = df_scree_s2['Indiv'].cumsum()
df_scree_s2
fig, ax = plt.subplots(figsize=(15, 10))
plt.bar(range(len(pca_s2.explained_variance_ratio_)), pca_s2.explained_variance_ratio_,
label='Individual', axes=ax, alpha=0.4)
plt.plot(range(len(pca_s2.explained_variance_ratio_)), pca_s2.explained_variance_ratio_.cumsum(),
label='Cumulative', color='tomato', axes=ax, marker='o')
ax.set_xticks(range(0,11))
ax.set_xticklabels(range(1,11), fontsize=12)
ax.set_yticklabels(range(0,90,10), fontsize=12)
plt.title('Plot of PCA - Scenario 2 (All feature with PCA)', fontsize=12)
plt.ylabel('Explained variance (%)', fontsize=12)
plt.xlabel('Principal components', fontsize=12)
plt.legend()
plt.show()
It can be seen from the above graph that 77% of the variance in data is explained by 6 components.
df_pca_s2 = pd.DataFrame(pca_results_s2)
df_pca_s2.columns = ['PC1', 'PC2','PC3','PC4','PC5','PC6']
df_pca_s2['label'] = kmeanModel_s2.labels_
df_pca_s2.head()
2-D Scatter Plot of Scenario 2
sns.set_style('white')
sns.scatterplot(data=df_pca_s2, x='PC1', y='PC2', hue='label', palette='Set1')
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.title('Visualisation of Songs')
plt.show()
It can be seen from the above cluster plot that there is some overlap between the red and grey cluster. Clusters are overlaping even further when the value of k is increased in the k-means algorithm.
3-D Scatter Plot of Scenario 2
from matplotlib import pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure()
ax = Axes3D(fig)
ax.scatter(df_pca_s2['PC1'], df_pca_s2['PC2'], df_pca_s2['PC3'],
c=df_pca_s2['label'], cmap='Set1')
ax.set_xlabel('PC1')
ax.set_ylabel('PC2')
ax.set_zlabel('PC3')
plt.show()
Song feature distribution compared to overall distribution
MergedMusic_clean.loc[:,'cluster'] = kmeanModel_s2.labels_
# set binning intervals of 0.1
bins = np.linspace(0,1,10)
# create subplots
num_features = len(cluster_features_all)
f, axes = plt.subplots(num_clusters, num_features,
figsize=(20, 10), sharex='col')
# initialise 1st row. Representing cluster
row = 0
for cluster in np.sort(MergedMusic_clean['cluster'].unique()):
# filter cluster dataframe and reset column to 0 for 1st plot for the cluster
df_cluster_all_scale = MergedMusic_clean[MergedMusic_clean['cluster'] == cluster]
col = 0
for feature in cluster_features_all:
# create binned count for all recent df and clustered df
rec_grp = MergedMusic_clean.groupby(pd.cut(MergedMusic_clean[feature], bins)).size().reset_index(name='count')
cluster_grp = df_cluster_all_scale.groupby(pd.cut(df_cluster_all_scale[feature], bins)).size().reset_index(name='count')
# plot overall distribution and cluster distribution on the ax
sns.barplot(data=rec_grp, x=feature, y='count',
color='grey', ax=axes[row, col])
sns.barplot(data=cluster_grp, x=feature, y='count',
color='red', ax=axes[row, col])
# configure ax
axes[row, col].set_xlabel('')
axes[row, col].set_xticklabels(range(1,10), fontsize=12)
if col > 0:
axes[row, col].set_ylabel('')
else:
axes[row, col].set_ylabel('count', fontsize=12)
if row == 0:
axes[row, col].set_title(feature, fontsize=14)
col += 1
row += 1
f.suptitle('Profile for each clusters')
plt.show()
In the above graph, each row represents the cluster, 0 to 3, and each column represents the feature. The grey bar represents the distribution of the feature. This gives a rough idea of the distribution of the feature. The red bar represents the distribution of the feature in that cluster which is used to compare against the other clusters.
Looking at the distribution of each cluster, it can be seen that each cluster is high or low in certain features. This is identified by whether the red bar is on the right(high) or left(low) with respect to the grey bar. From these characteristics, songs can be profiled into different themes.
Interpretation of each feature:
Acousticness: This feature is uniquely identifying the clusters as it can be seen from the above graph. It is low for cluster 0 and 3 while high for cluster 1. It has no effect on cluster 2.
Danceability: This feature is uniquely identifying the clusters as it can be seen from the above graph. It is high for cluster 0 while in the middle for cluster 1 and 3. It has very little effect on cluster 2.
Energy: This feature is uniquely identifying the clusters as it can be seen from the above graph. It is high for cluster 0 and 1 while low for cluster 1. It has no effect on cluster 2.
Instrumentalness: This feature is not uniquely identifying the clusters as it can be seen from the above graph. The distribution seems similar across all the clusters.
Key: This feature is not uniquely identifying the clusters as it can be seen from the above graph. The distribution seems similar across all the clusters.
Liveness: This feature is uniquely identifying the clusters as it can be seen from the above graph. It is low for cluster 0 and 1. It has no effect on cluster 2 and is distributed across for cluster 3.
Loudness: This feature is not uniquely identifying the clusters as it can be seen from the above graph. The distribution seems similar across all the clusters.
Mode: This feature is not uniquely identifying the clusters as it can be seen from the above graph. The distribution seems similar across all the clusters.
Speechiness: This feature is not uniquely identifying the clusters as it can be seen from the above graph. The distribution seems similar across all the clusters.
Tempo: This feature is not uniquely identifying the clusters as it can be seen from the above graph. The distribution seems similar across all the clusters.
Valence: This feature is uniquely identifying the clusters as it can be seen from the above graph. It is high for cluster 0 while low for cluster 1. It has no effect on cluster 2 and is distributed across for cluster 3.
Average song features per cluster
# calculate mean of each variable
radar_col_PCA = cluster_features_all + ['cluster']
# feature average for each cluster as a radar chart
df_radar_all_PCA = MergedMusic_clean[radar_col_PCA]
df_radar_all_PCA = df_radar_all_PCA.groupby('cluster').mean().reset_index()
# initialize the figure
plt.figure(figsize=(24,15))
# Create a color palette:
my_palette = plt.cm.get_cmap("Set1", len(df_radar_all_PCA.index))
# Create cluster name
title_list = ['Chill Vibes', 'lyrical', 'Dance (Rock/Pop)', 'Folk']
# Loop to plot
for row in range(0, len(df_radar_all_PCA.index)):
make_radar(row=row, title=str(df_radar_all_PCA['cluster'][row]+1) + ' : ' + title_list[row],
color=my_palette(row), dframe=df_radar_all_PCA, num_clusters=len(df_radar_all_PCA.index))
title=str(df_radar_all_PCA['cluster'][row]) + ' : ' + title_list[row],
# Show plot
plt.show()
The readings of the radar chart is similar to the profile given above.
Interpretation of each Cluster:
Cluster 0 (1: Chill Vibes):
High: Danceability, Energy, Key, Mode, Tempo, Valence.
Low: Acousticness, Instrumentalness, Liveness, Loudness, Speechiness.
Cluster 1 (2: Lyrical):
High: Acousticness, Key, Mode, Tempo.
Low: Danceability, Energy, Instrumentalness, Liveness, Loudness, Speechiness, Valence.
Cluster 2 (3: Dance (Rock/Pop)):
High: Acousticness, Danceability, Energy, Key, Liveness, Mode, Speechiness, Tempo, Valence.
Low: Instrumentalness, Loudness.
Cluster 3 (4: Folk):
High: Energy, Key, Mode, Tempo.
Low: Acousticness, Danceability, Instrumentalness, Liveness, Loudness, Speechiness, Valence.
#Selected audio features
cluster_features_selected = ['acousticness','danceability','energy','liveness','valence']
df_cluster_selected = MergedMusic_clean[cluster_features_selected]
df_cluster_selected_scale = np.array(df_cluster_selected)
scaler = StandardScaler()
scaler.fit(df_cluster_selected_scale)
df_cluster_selected_scale = scaler.transform(df_cluster_selected_scale)
num_clusters = 4
kmeanModel_s3 = KMeans(n_clusters=num_clusters, max_iter=100, init='k-means++', random_state=123).fit(df_cluster_selected_scale)
Visualisation of cluster
Here selected features are considered and number of components in PCA is taken as 3.
pca_s3 = PCA(n_components=3, random_state=123)
pca_results_s3 = pca_s3.fit_transform(df_cluster_selected_scale)
print(pca_s3.explained_variance_ratio_.sum())
pca_s3.explained_variance_ratio_.cumsum()
df_scree_s3 = pd.DataFrame({'Component': ['1','2','3'],'Indiv':pca_s3.explained_variance_ratio_})
df_scree_s3['cum_sum'] = df_scree_s3['Indiv'].cumsum()
df_scree_s3
fig, ax = plt.subplots(figsize=(15, 10))
plt.bar(range(len(pca_s3.explained_variance_ratio_)), pca_s3.explained_variance_ratio_,
label='Individual', axes=ax, alpha=0.4)
plt.plot(range(len(pca_s3.explained_variance_ratio_)), pca_s3.explained_variance_ratio_.cumsum(),
label='Cumulative', color='tomato', axes=ax, marker='o')
ax.set_xticks(range(0,3))
ax.set_xticklabels(range(1,3), fontsize=12)
ax.set_yticklabels(range(0,90,10), fontsize=12)
plt.title('Plot of PCA - Scenario 3 (Selected feature with PCA)', fontsize=12)
plt.ylabel('Explained variance (%)', fontsize=12)
plt.xlabel('Principal components', fontsize=12)
plt.legend()
plt.show()
It can be seen from the above graph that 85% of the variance in data is explained by 3 components.
df_pca_s3 = pd.DataFrame(pca_results_s3)
df_pca_s3.columns = ['PC1', 'PC2','PC3']
df_pca_s3['label'] = kmeanModel_s3.labels_
df_pca_s3.head()
2-D Scatter Plot of Scenario 3
sns.set_style('white')
sns.scatterplot(data=df_pca_s3, x='PC1', y='PC2', hue='label', palette='Set1')
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.title('Visualisation of Songs')
plt.show()
It can be seen from the above cluster plot that now the overalp between the grey and red is gone. The cluster seems to be more distinctive now.
3-D Scatter Plot of Scenario 3
from matplotlib import pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure()
ax = Axes3D(fig)
ax.scatter(df_pca_s3['PC1'], df_pca_s3['PC2'], df_pca_s3['PC3'],
c=df_pca_s3['label'], cmap='Set1')
ax.set_xlabel('PC1')
ax.set_ylabel('PC2')
ax.set_zlabel('PC3')
plt.show()
Song feature distribution compared to overall distribution
MergedMusic_clean.loc[:,'cluster'] = kmeanModel_s3.labels_
# set binning intervals of 0.1
bins = np.linspace(0,1,10)
# create subplots
num_features = len(cluster_features_selected)
f, axes = plt.subplots(num_clusters, num_features,
figsize=(20, 10), sharex='col')
# initialise 1st row. Representing cluster
row = 0
for cluster in np.sort(MergedMusic_clean['cluster'].unique()):
# filter cluster dataframe and reset column to 0 for 1st plot for the cluster
df_cluster_selected_scale = MergedMusic_clean[MergedMusic_clean['cluster'] == cluster]
col = 0
for feature in cluster_features_selected:
# create binned count for all recent df and clustered df
rec_grp = MergedMusic_clean.groupby(pd.cut(MergedMusic_clean[feature], bins)).size().reset_index(name='count')
cluster_grp = df_cluster_selected_scale.groupby(pd.cut(df_cluster_selected_scale[feature], bins)).size().reset_index(name='count')
# plot overall distribution and cluster distribution on the ax
sns.barplot(data=rec_grp, x=feature, y='count',
color='grey', ax=axes[row, col])
sns.barplot(data=cluster_grp, x=feature, y='count',
color='red', ax=axes[row, col])
# configure ax
axes[row, col].set_xlabel('')
axes[row, col].set_xticklabels(range(1,10), fontsize=12)
if col > 0:
axes[row, col].set_ylabel('')
else:
axes[row, col].set_ylabel('count', fontsize=12)
if row == 0:
axes[row, col].set_title(feature, fontsize=14)
col += 1
row += 1
f.suptitle('Profile for each clusters')
plt.show()
In the above graph, each row represents the cluster, 0 to 3, and each column represents the feature. The grey bar represents the distribution of the feature. This gives a rough idea of the distribution of the feature. The red bar represents the distribution of the feature in that cluster which is used to compare against the other clusters.
Looking at the distribution of each cluster, it can be seen that each cluster is high or low in certain features. This is identified by whether the red bar is on the right(high) or left(low) with respect to the grey bar. From these characteristics, songs can be profiled into different themes.
Interpretation of each feature:
Acousticness: This feature is uniquely identifying the clusters as it can be seen from the above graph. It is low for cluster 0 and 3 while high for cluster 1. It has no effect on cluster 2.
Danceability: This feature is uniquely identifying the clusters as it can be seen from the above graph. It is high for cluster 0 while in the middle for cluster 1 and 3. It has very little effect on cluster 2.
Energy: This feature is uniquely identifying the clusters as it can be seen from the above graph. It is high for cluster 0 and 1 while low for cluster 1. It has no effect on cluster 2.
Liveness: This feature is uniquely identifying the clusters as it can be seen from the above graph. It is low for cluster 0 and 1. It has no effect on cluster 2 and is distributed across for cluster 3.
Valence: This feature is uniquely identifying the clusters as it can be seen from the above graph. It is high for cluster 0 while low for cluster 1. It has no effect on cluster 2 and is distributed across for cluster 3.
Average song features per cluster
# calculate mean of each variable
radar_col_selected_PCA = cluster_features_selected + ['cluster']
# feature average for each cluster as a radar chart
df_radar_selected_PCA = MergedMusic_clean[radar_col_selected_PCA]
df_radar_selected_PCA = df_radar_selected_PCA.groupby('cluster').mean().reset_index()
# initialize the figure
plt.figure(figsize=(24,15))
# Create a color palette:
my_palette = plt.cm.get_cmap("Set1", len(df_radar_selected_PCA.index))
# Create cluster name
title_list = ['Chill Vibes', 'lyrical', 'Dance (Rock/Pop)', 'Folk']
# Loop to plot
for row in range(0, len(df_radar_selected_PCA.index)):
make_radar(row=row, title=str(df_radar_selected_PCA['cluster'][row]+1) + ' : ' + title_list[row],
color=my_palette(row), dframe=df_radar_selected_PCA, num_clusters=len(df_radar_selected_PCA.index))
title=str(df_radar_selected_PCA['cluster'][row]) + ' : ' + title_list[row],
# Show plot
plt.show()
The readings of the radar chart is similar to the profile given above.
Interpretation of each Cluster:
Cluster 0 (1: Chill Vibes):
High: Danceability, Energy, Valence.
Low: Acousticness, Liveness.
Cluster 1 (2: Lyrical):
High: Loudness, Energy.
Low: Acousticness, Danceability, Valence.
Cluster 2 (3: Dance (Rock/Pop)):
High: Danceability, Energy.
Low: Acousticness, Liveness, Valence.
Cluster 3 (4: Folk):
High: Acousticness.
Low: Danceability, Energy, Liveness, Valence.
#Selected audio features
cluster_features_selected = ['acousticness','danceability','energy','liveness','valence']
df_cluster_selected = MergedMusic_clean[cluster_features_selected]
df_cluster_selected_scale = np.array(df_cluster_selected)
scaler = StandardScaler()
scaler.fit(df_cluster_selected_scale)
df_cluster_selected_scale = scaler.transform(df_cluster_selected_scale)
num_clusters = 4
kmeanModel_s4 = KMeans(n_clusters=num_clusters, max_iter=100, init='k-means++', random_state=123).fit(df_cluster_selected_scale)
Visualisation of cluster
Here selected features are considered but without any PCA. THerefore, the PCA is set to 5.
pca_s4 = PCA(n_components=5, random_state=123)
pca_results_s4 = pca_s4.fit_transform(df_cluster_selected_scale)
print(pca_s4.explained_variance_ratio_.sum())
pca_s4.explained_variance_ratio_.cumsum()
df_scree_s4 = pd.DataFrame({'Component': ['1','2','3','4','5'],'Indiv':pca_s4.explained_variance_ratio_})
df_scree_s4['cum_sum'] = df_scree_s4['Indiv'].cumsum()
df_scree_s4
fig, ax = plt.subplots(figsize=(15, 10))
plt.bar(range(len(pca_s4.explained_variance_ratio_)), pca_s4.explained_variance_ratio_,
label='Individual', axes=ax, alpha=0.4)
plt.plot(range(len(pca_s4.explained_variance_ratio_)), pca_s4.explained_variance_ratio_.cumsum(),
label='Cumulative', color='tomato', axes=ax, marker='o')
ax.set_xticks(range(0,5))
ax.set_xticklabels(range(1,5), fontsize=12)
ax.set_yticklabels(range(0,90,10), fontsize=12)
plt.title('Plot of PCA - Scenario 4 (Selected feature without PCA)', fontsize=12)
plt.ylabel('Explained variance (%)', fontsize=12)
plt.xlabel('Principal components', fontsize=12)
plt.legend()
plt.show()
It can be seen from the above graph that 95% of the variance in data is explained by 4 components.
df_pca_s4 = pd.DataFrame(pca_results_s4)
df_pca_s4.columns = ['PC1', 'PC2','PC3','PC4','PC5']
df_pca_s4['label'] = kmeanModel_s4.labels_
df_pca_s4.head()
2-D Scatter Plot of Scenario 4
sns.set_style('white')
sns.scatterplot(data=df_pca_s4, x='PC1', y='PC2', hue='label', palette='Set1')
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.title('Visualisation of Songs')
plt.show()
It can be seen from the above cluster plot that now the overlap between the grey and red is gone. The cluster seems to be more distinctive now.
3-D Scatter Plot of Scenario 4
from matplotlib import pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure()
ax = Axes3D(fig)
ax.scatter(df_pca_s4['PC1'], df_pca_s4['PC2'], df_pca_s4['PC3'],
c=df_pca_s4['label'], cmap='Set1')
ax.set_xlabel('PC1')
ax.set_ylabel('PC2')
ax.set_zlabel('PC3')
plt.show()
Song feature distribution compared to overall distribution
MergedMusic_clean.loc[:,'cluster'] = kmeanModel_s4.labels_
# set binning intervals of 0.1
bins = np.linspace(0,1,10)
# create subplots
num_features = len(cluster_features_selected)
f, axes = plt.subplots(num_clusters, num_features,
figsize=(20, 10), sharex='col')
# initialise 1st row. Representing cluster
row = 0
for cluster in np.sort(MergedMusic_clean['cluster'].unique()):
# filter cluster dataframe and reset column to 0 for 1st plot for the cluster
df_cluster_selected_scale = MergedMusic_clean[MergedMusic_clean['cluster'] == cluster]
col = 0
for feature in cluster_features_selected:
# create binned count for all recent df and clustered df
rec_grp = MergedMusic_clean.groupby(pd.cut(MergedMusic_clean[feature], bins)).size().reset_index(name='count')
cluster_grp = df_cluster_selected_scale.groupby(pd.cut(df_cluster_selected_scale[feature], bins)).size().reset_index(name='count')
# plot overall distribution and cluster distribution on the ax
sns.barplot(data=rec_grp, x=feature, y='count',
color='grey', ax=axes[row, col])
sns.barplot(data=cluster_grp, x=feature, y='count',
color='red', ax=axes[row, col])
# configure ax
axes[row, col].set_xlabel('')
axes[row, col].set_xticklabels(range(1,10), fontsize=12)
if col > 0:
axes[row, col].set_ylabel('')
else:
axes[row, col].set_ylabel('count', fontsize=12)
if row == 0:
axes[row, col].set_title(feature, fontsize=14)
col += 1
row += 1
f.suptitle('Profile for each clusters')
plt.show()
In the above graph, each row represents the cluster, 0 to 3, and each column represents the feature. The grey bar represents the distribution of the feature. This gives a rough idea of the distribution of the feature. The red bar represents the distribution of the feature in that cluster which is used to compare against the other clusters.
Looking at the distribution of each cluster, it can be seen that each cluster is high or low in certain features. This is identified by whether the red bar is on the right(high) or left(low) with respect to the grey bar. From these characteristics, songs can be profiled into different themes.
Interpretation of each feature:
Acousticness: This feature is uniquely identifying the clusters as it can be seen from the above graph. It is low for cluster 0 and 3 while high for cluster 1. It has no effect on cluster 2.
Danceability: This feature is uniquely identifying the clusters as it can be seen from the above graph. It is high for cluster 0 while in the middle for cluster 1 and 3. It has very little effect on cluster 2.
Energy: This feature is uniquely identifying the clusters as it can be seen from the above graph. It is high for cluster 0 and 1 while low for cluster 1. It has no effect on cluster 2.
Liveness: This feature is uniquely identifying the clusters as it can be seen from the above graph. It is low for cluster 0 and 1. It has no effect on cluster 2 and is distributed across for cluster 3.
Valence: This feature is uniquely identifying the clusters as it can be seen from the above graph. It is high for cluster 0 while low for cluster 1. It has no effect on cluster 2 and is distributed across for cluster 3.
Average song features per cluster
# calculate mean of each variable
radar_col_selected = cluster_features_selected + ['cluster']
# feature average for each cluster as a radar chart
df_radar_selected = MergedMusic_clean[radar_col_selected]
df_radar_selected = df_radar_selected.groupby('cluster').mean().reset_index()
# initialize the figure
plt.figure(figsize=(24,15))
# Create a color palette:
my_palette = plt.cm.get_cmap("Set1", len(df_radar_selected.index))
# Create cluster name
title_list = ['Chill Vibes', 'lyrical', 'Dance (Rock/Pop)', 'Folk']
# Loop to plot
for row in range(0, len(df_radar_selected.index)):
make_radar(row=row, title=str(df_radar_selected['cluster'][row]+1) + ' : ' + title_list[row],
color=my_palette(row), dframe=df_radar_selected, num_clusters=len(df_radar_selected.index))
title=str(df_radar_selected['cluster'][row]) + ' : ' + title_list[row],
# Show plot
plt.show()
The readings of the radar chart is similar to the profile given above.
Interpretation of each Cluster:
Cluster 0 (1: Chill Vibes):
High: Danceability, Energy, Valence.
Low: Acousticness, Liveness.
Cluster 1 (2: Lyrical):
High: Loudness, Energy.
Low: Acousticness, Danceability, Valence.
Cluster 2 (3: Dance (Rock/Pop)):
High: Danceability, Energy.
Low: Acousticness, Liveness, Valence.
Cluster 3 (4: Folk):
High: Acousticness.
Low: Danceability, Energy, Liveness, Valence.
Based on the above four scenarios, the last scenario seems to be giving the most appropriate cluster for the purpose of this project.
Cosine looks at the angle between vectors. It is generally used when the magnitude of vectors does not matter. For example, if we are taking into the consideration text data for word count. We can assume when a word(e.g Analytics) occurs more in the first document than it does in second, that document 1 is more related to Analytics. Whereas it could be the case that document 1 is simply bigger than document 2 and so the word is used more times.
Mahalanobis distance is a measure of the distance between a point P and a distribution D. It is a multi-dimensional generalization of the idea of measuring how many standard deviations away P is from the mean of D.
Mahalanobis distance was used to calculate similarity because it considers variable intercorrelations and weighs each variable equally.
Themes were assigned to each of the songs in the original MergedMusic data and labels were created for the four themes by listening to the songs in each cluster.
#Getting music file with clusters
MergedMusic_clustered = MergedMusic_clean.join(df_pca_s4, how='outer')
MergedMusic_clustered['cluster'] += 1
MergedMusic_clustered.loc[MergedMusic_clustered['label']== 0,['label']] = '1: Chill Vibes'
MergedMusic_clustered.loc[MergedMusic_clustered['label']== 1,['label']] = '2: lyrical'
MergedMusic_clustered.loc[MergedMusic_clustered['label']== 2,['label']] = '3: Dance (Rock/Pop)'
MergedMusic_clustered.loc[MergedMusic_clustered['label']== 3,['label']] = '4: Folk'
MergedMusic_clustered.head()
df_Music = df_radar_selected
Cluster row was dropped to get a vector of average features per cluster only.
if 'cluster' in df_Music.columns:
del df_Music['cluster']
df_vector1 below is a list of vector for each cluster.
df_vector1 = df_Music[['acousticness', 'danceability', 'energy', 'liveness', 'valence']].values.tolist()
df_vector1
Mahalanobis requires the inverse covrance of the vector to be calculated.
#Calculate covariance matrix
covmx = df_Music.cov()
invcovmx = sp.linalg.inv(covmx)
Training and Testing set
User music data was split randomly between the training (70%) and test (30%) set for each user.
#importing user data
UserMusic_Train = pd.read_csv(r'UserMusic_Train.csv', encoding='ISO-8859-1')
UserMusic_Test = pd.read_csv(r'UserMusic_Test.csv', encoding='ISO-8859-1')
UserMusic_Train.head(2)
UserMusic = UserMusic_Train[['acousticness', 'danceability', 'energy', 'liveness', 'valence']]
UserMusic_Train['cluster'] = np.nan
df_vector3 = UserMusic[['acousticness', 'danceability', 'energy', 'liveness', 'valence']].values.tolist()
We picked a song from user data and used Mahalanobis to find the theme closest to a particular song.
list11=[]
for j in range(len(df_vector3)):
for i in range(4):
list11.append(mahalanobis(df_vector1[i], df_vector3[j], invcovmx))
list11.index(min(list11))
UserMusic_Train['cluster'].iloc[j] = list11.index(min(list11)) + 1
list11=[]
print(UserMusic_Train.head())
We calculated the probability of a particular user liking a theme. This was calculated by dividing the number of liked songs by the total number of songs a user has listend to in a particular theme.
UserMusicGrouped_train = pd.DataFrame(UserMusic_Train.groupby(['user_id','cluster','user_response'], as_index = False)['id'].count())
UserMusicGrouped2_train = pd.DataFrame(UserMusic_Train.groupby(['user_id','cluster',], as_index = False)['id'].count())
print(UserMusicGrouped_train.head())
prob = []
for i in range(30):
for j in range(4):
t1 =(UserMusicGrouped_train[(UserMusicGrouped_train['user_id'] == (i+1)) & (UserMusicGrouped_train['cluster'] == (j+1)) & (UserMusicGrouped_train['user_response'] == 1)]['id']).values
t2 = (UserMusicGrouped_train[(UserMusicGrouped_train['user_id'] == (i+1)) & (UserMusicGrouped_train['cluster'] == (j+1)) & (UserMusicGrouped_train['user_response'] == 0)]['id']).values
prob = (np.around((t1/(t1+t2)), decimals = 2))
if len(prob)>0:
prob[0] = prob[0]
else:
prob = [0]
UserMusicGrouped2_train['id'] = UserMusicGrouped2_train['id'].astype(object)
UserMusicGrouped2_train.loc[(UserMusicGrouped2_train['user_id'] == (i+1)) & (UserMusicGrouped2_train['cluster'] == (j+1)), ['id']]= prob[0]
t1 = 0
t2 = 0
UserMusicGrouped2_train.head()
The objective of any recommendar system is to be as accurate as possible when recommending a song to a user. Here the objective was to see how good our recommender will perform to the unseen data.
UserMusic_Test['predicted_response'] = np.nan
for i in range(1,len(UserMusic_Test['user_id'])):
if (UserMusicGrouped2_train[(UserMusicGrouped2_train['user_id'] == UserMusic_Test.iloc[i]['user_id']) &
(UserMusicGrouped2_train['cluster'] == UserMusic_Test.iloc[i]['cluster'])]['id']).values > 0.5:
#UserMusic_Test.iloc[i]['predicted_response'] = 1
UserMusic_Test.loc[i,'predicted_response'] = 1
else:
UserMusic_Test.loc[i,'predicted_response'] = 0
UserMusic_Test.head()
Total_Test = len(UserMusic_Test['user_id'])
Correct_Prediction = 0
for i in range(1,len(UserMusic_Test['user_id'])):
if UserMusic_Test.loc[i,'predicted_response'] == UserMusic_Test.loc[i,'user_response']:
Correct_Prediction = Correct_Prediction + 1
else:
Correct_Prediction = Correct_Prediction + 0
Accuracy = (Correct_Prediction/Total_Test)*100
print("The accuracy of the recommender system is " + repr(str(round(Accuracy, 2))) + "%.")
It can be seen from above that our personalised recommender system's accracy is about 74%. This is not a bad for a baisc recommender system. This recommender system can be further improved by adding user based recommendation as an add-on.
The recommender system can let the audience know whether a particular song should be recommended to a user based on their taste. Secondly, it also provided top 5 songs to a user from a particular theme.
Recommendation by Song
while True:
try:
User_ID = int(input('Please enter an user id:'))
break
except:
print("Please enter a whole number!")
while True:
try:
Song_name = input('Please enter a song name:')
break
except:
print("Please enter a valid id!")
Cluster_ID = (MergedMusic_clustered[MergedMusic_clustered['song_name'] == Song_name]['cluster']).values
if (UserMusicGrouped2_train[(UserMusicGrouped2_train['user_id'] == User_ID) & (UserMusicGrouped2_train['cluster'] == Cluster_ID[0])]['id']).values > 0.5:
print('This song is recommended!')
else:
print('This song is not recommended!')
#Beware the Bull (1991)
Recommendation by Theme
while True:
try:
User_ID = int(input('Please enter an user id:'))
break
except:
print("Please enter a whole number!")
while True:
try:
Theme_name = input('Please enter a Theme:')
break
except:
print("Please enter a valid id!")
#1: Chill Vibes (use this value as an example)
Top5songs = pd.DataFrame(MergedMusic_clustered[MergedMusic_clustered["label"]==Theme_name].nlargest(5, "popularity")["song_name"])
print("Top 5 songs in " + Theme_name + " for user " + repr(str(User_ID)) +":")
print(Top5songs)Conclusion:
- For supervised Learning objective, predicting the popularity of a song based on it audio features, Random forest has outperformed all the other models.
- For unsupervised learning, our personalized recommender system, was able to predict whether a particular song or theme is recommended for the user or not, based on his/her liking.